
<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   
      <title>18.3.&nbsp;优化正则表达式</title>
      <link rel="stylesheet" href="../diveintopython.css" type="text/css">
      <link rev="made" href="mailto:f8dy@diveintopython.org">
      <meta name="generator" content="DocBook XSL Stylesheets V1.52.2">
      <meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free">
      <meta name="description" content="Python from novice to pro">
      <link rel="home" href="../toc/index.html" title="Dive Into Python">
      <link rel="up" href="index.html" title="第&nbsp;18&nbsp;章&nbsp;性能优化">
      <link rel="previous" href="timeit.html" title="18.2.&nbsp;使用 timeit 模块">
      <link rel="next" href="dictionary_lookups.html" title="18.4.&nbsp;优化字典查找">
   </head>
   <body>
      <table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
         <tr>
            <td id="breadcrumb" colspan="5" align="left" valign="top">导航：<a href="../index.html">起始页</a>&nbsp;&gt;&nbsp;<a href="../toc/index.html">Dive Into Python</a>&nbsp;&gt;&nbsp;<a href="index.html">性能优化</a>&nbsp;&gt;&nbsp;<span class="thispage">优化正则表达式</span></td>
            <td id="navigation" align="right" valign="top">&nbsp;&nbsp;&nbsp;<a href="timeit.html" title="上一页: “使用 timeit 模块”">&lt;&lt;</a>&nbsp;&nbsp;&nbsp;<a href="dictionary_lookups.html" title="下一页: “优化字典查找”">&gt;&gt;</a></td>
         </tr>
         <tr>
            <td colspan="3" id="logocontainer">
               <h1 id="logo"><a href="../index.html" accesskey="1">深入 Python :Dive Into Python 中文版</a></h1>
               <p id="tagline">Python 从新手到专家 [Dip_5.4b_CPyUG_Release]</p>
            </td>
            <td colspan="3" align="right">
               <form id="search" method="GET" action="http://www.google.com/custom">
                  <p><label for="q" accesskey="4">Find:&nbsp;</label><input type="text" id="q" name="q" size="20" maxlength="255" value=""> <input type="submit" value="搜索"><input type="hidden" name="domains" value="woodpecker.org.cn/diveintopython"><input type="hidden" name="sitesearch" value="www.woodpecker.org.cn/diveintopython"></p>
               </form>
            </td>
         </tr>
      </table>
      <!--#include virtual="/inc/ads" -->
      <div class="section" lang="zh_cn">
         <div class="titlepage">
            <div>
               <div>
                  <h2 class="title"><a name="soundex.stage1"></a>18.3.&nbsp;优化正则表达式
                  </h2>
               </div>
            </div>
            <div></div>
         </div>
         <div class="abstract">
            <p> Soundex 函数的第一件事是检查输入是否是一个空字符串。怎样做是最好的方法？</p>
         </div>
         <p>如果你回答 “<span class="quote">正则表达式</span>”，坐在角落里反省你糟糕的直觉。正则表达式几乎永远不是最好的答案，而且应该被尽可能避开。这不仅仅是基于性能考虑，而是因为调试和维护都很困难，当然性能也是个原因。
         </p>
         <p>这是 <tt class="filename">soundex/stage1/soundex1a.py</tt> 检查 <tt class="varname">source</tt> 是否全部由字母构成的一段代码，至少是一个字母 (而不是空字符串)：
         </p>
         <div class="informalexample"><pre class="programlisting">
    allChars = string.uppercase + string.lowercase
    <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> re.search(<span class='pystring'>'^[%s]+$'</span> % allChars, source):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p><tt class="filename">soundex1a.py</tt> 表现如何？为了方便，<tt class="literal">__main__</tt> 部分包含了一段代码：调用 <tt class="filename">timeit</tt> 模块，为三个不同名字分别建立测试，依次测试，并显示每个测试的最短耗时：
         </p>
         <div class="informalexample"><pre class="programlisting"><span class='pykeyword'>
if</span> __name__ == <span class='pystring'>'__main__'</span>:
    <span class='pykeyword'>from</span> timeit <span class='pykeyword'>import</span> Timer
    names = (<span class='pystring'>'Woo'</span>, <span class='pystring'>'Pilgrim'</span>, <span class='pystring'>'Flingjingwaller'</span>)
    <span class='pykeyword'>for</span> name <span class='pykeyword'>in</span> names:
        statement = <span class='pystring'>"soundex('%s')"</span> % name
        t = Timer(statement, <span class='pystring'>"from __main__ import soundex"</span>)
        <span class='pykeyword'>print</span> name.ljust(15), soundex(name), min(t.repeat())
</pre></div>
         <p>那么，应用正则表达式的 <tt class="filename">soundex1a.py</tt> 表现如何呢？
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1a.py</span>
<span class="computeroutput">Woo             W000 19.3356647283
Pilgrim         P426 24.0772053431
Flingjingwaller F452 35.0463220884</span>
</pre></div>
         <p>正如你预料，名字越长，算法耗时就越长。有几个工作可以令我们减小这个差距 (使函数对于长输入花费较短的相对时间) 但是算法的本质决定它不可能每次运行时间都相同。</p>
         <p>另一点应铭记于心的是，我们测试的是有代表性的名字样本。<tt class="literal">Woo</tt> 是个被缩短到单字符并补零的小样本；<tt class="literal">Pilgrim</tt> 是个夹带着特别字符和忽略字符的平均长度的正常样本；<tt class="literal">Flingjingwaller</tt> 是一个包含连续重复字符并且特别长的样本。其它的测试可能同样有帮助，但它们已经很好地代表了不同的样本范围。
         </p>
         <p>那么那个正则表达式如何呢？嗯，缺乏效率。因为这个表达式测试不止一个范围的字符 (<tt class="literal">A-Z</tt> 的大写范围和 <tt class="literal">a-z</tt> 的小写字母范围)，我们可以使用一个正则表达式的缩写语法。这便是 <tt class="filename">soundex/stage1/soundex1b.py</tt>:
         </p>
         <div class="informalexample"><pre class="programlisting">
    <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> re.search(<span class='pystring'>'^[A-Za-z]+$'</span>, source):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p><tt class="filename">timeit</tt> 显示 <tt class="filename">soundex1b.py</tt> 比 <tt class="filename">soundex1a.py</tt> 稍微快一些，但是没什么令人激动的变化：
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1b.py</span>
<span class="computeroutput">Woo             W000 17.1361133887
Pilgrim         P426 21.8201693232
Flingjingwaller F452 32.7262294509</span>
</pre></div>
         <p>在 <a href="../refactoring/refactoring.html" title="15.3.&nbsp;重构">第&nbsp;15.3&nbsp;节 “重构”</a> 中我们看到正则表达式可以被编译并在重用时以更快速度获得结果。因为这个正则表达式在函数中每次被调用时都不变化，我们可以编译它一次并使用被编译的版本。这便是 <tt class="filename">soundex/stage1/soundex1c.py</tt>：
         </p>
         <div class="informalexample"><pre class="programlisting">
isOnlyChars = re.compile(<span class='pystring'>'^[A-Za-z]+$'</span>).search
<span class='pykeyword'>def</span><span class='pyclass'> soundex</span>(source):
    <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> isOnlyChars(source):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p><tt class="filename">soundex1c.py</tt> 中使用被编译的正则表达式产生了显著的提速：
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1c.py</span>
<span class="computeroutput">Woo             W000 14.5348347346
Pilgrim         P426 19.2784703084
Flingjingwaller F452 30.0893873383</span>
</pre></div>
         <p>但是这样的优化是正路吗？这里的逻辑很简单：输入 <tt class="varname">source</tt> 应该是非空，并且需要完全由字母构成。如果编写一个循环查看每个字符并且抛弃正则表达式，是否会更快些？
         </p>
         <p>这便是 <tt class="filename">soundex/stage1/soundex1d.py</tt>：
         </p>
         <div class="informalexample"><pre class="programlisting">
    <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> source:
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
    <span class='pykeyword'>for</span> c <span class='pykeyword'>in</span> source:
        <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> (<span class='pystring'>'A'</span> &lt;= c &lt;= <span class='pystring'>'Z'</span>) <span class='pykeyword'>and</span> <span class='pykeyword'>not</span> (<span class='pystring'>'a'</span> &lt;= c &lt;= <span class='pystring'>'z'</span>):
            <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p>这个技术在 <tt class="filename">soundex1d.py</tt> 中恰好<span class="emphasis"><em>不及</em></span> 编译后的正则表达式快 (尽管比使用未编译的正则表达式快<sup>[<a name="d0e39400" href="#ftn.d0e39400">14</a>]</sup>)：
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1d.py</span>
<span class="computeroutput">Woo             W000 15.4065058548
Pilgrim         P426 22.2753567842
Flingjingwaller F452 37.5845122774</span>
</pre></div>
         <p>为什么 <tt class="filename">soundex1d.py</tt> 没能更快？答案来自 <span class="application">Python</span> 的编译本质。正则表达式引擎以 C 语言编写，被编译后则能本能地在你的计算机上运行。另一方面，循环是以 <span class="application">Python</span> 编写，要通过 <span class="application">Python</span> 解释器。尽管循环相对简单，但没能简单到补偿花在代码解释上的时间。正则表达式永远不是正确答案……但例外还是存在的。
         </p>
         <p>恰巧 <span class="application">Python</span> 提供了一个晦涩的字符串方法。你有理由不了解它，因为本书未曾提到它。这个方法便是 <tt class="methodname">isalpha()</tt>，它检查一个字符串是否只包含字母。
         </p>
         <p>这便是 <tt class="filename">soundex/stage1/soundex1e.py</tt>：
         </p>
         <div class="informalexample"><pre class="programlisting">
    <span class='pykeyword'>if</span> (<span class='pykeyword'>not</span> source) <span class='pykeyword'>and</span> (<span class='pykeyword'>not</span> source.isalpha()):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p>在 <tt class="filename">soundex1e.py</tt> 中应用这个特殊方法我们能得到多少好处?  很多。
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1e.py</span>
<span class="computeroutput">Woo             W000 13.5069504644
Pilgrim         P426 18.2199394057
Flingjingwaller F452 28.9975225902</span>
</pre></div>
         <div class="example"><a name="d0e39467"></a><h3 class="title">例&nbsp;18.3.&nbsp;目前为止最好的结果：<tt class="filename">soundex/stage1/soundex1e.py</tt></h3><pre class="programlisting"><span class='pykeyword'>
import</span> string, re

charToSoundex = {<span class='pystring'>"A"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"B"</span>: <span class='pystring'>"1"</span>,
                 <span class='pystring'>"C"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"D"</span>: <span class='pystring'>"3"</span>,
                 <span class='pystring'>"E"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"F"</span>: <span class='pystring'>"1"</span>,
                 <span class='pystring'>"G"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"H"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"I"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"J"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"K"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"L"</span>: <span class='pystring'>"4"</span>,
                 <span class='pystring'>"M"</span>: <span class='pystring'>"5"</span>,
                 <span class='pystring'>"N"</span>: <span class='pystring'>"5"</span>,
                 <span class='pystring'>"O"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"P"</span>: <span class='pystring'>"1"</span>,
                 <span class='pystring'>"Q"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"R"</span>: <span class='pystring'>"6"</span>,
                 <span class='pystring'>"S"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"T"</span>: <span class='pystring'>"3"</span>,
                 <span class='pystring'>"U"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"V"</span>: <span class='pystring'>"1"</span>,
                 <span class='pystring'>"W"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"X"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"Y"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"Z"</span>: <span class='pystring'>"2"</span>}

<span class='pykeyword'>def</span><span class='pyclass'> soundex</span>(source):
    <span class='pykeyword'>if</span> (<span class='pykeyword'>not</span> source) <span class='pykeyword'>and</span> (<span class='pykeyword'>not</span> source.isalpha()):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
    source = source[0].upper() + source[1:]
    digits = source[0]
    <span class='pykeyword'>for</span> s <span class='pykeyword'>in</span> source[1:]:
        s = s.upper()
        digits += charToSoundex[s]
    digits2 = digits[0]
    <span class='pykeyword'>for</span> d <span class='pykeyword'>in</span> digits[1:]:
        <span class='pykeyword'>if</span> digits2[-1] != d:
            digits2 += d
    digits3 = re.sub(<span class='pystring'>'9'</span>, <span class='pystring'>''</span>, digits2)
    <span class='pykeyword'>while</span> len(digits3) &lt; 4:
        digits3 += <span class='pystring'>"0"</span>
    <span class='pykeyword'>return</span> digits3[:4]

<span class='pykeyword'>if</span> __name__ == <span class='pystring'>'__main__'</span>:
    <span class='pykeyword'>from</span> timeit <span class='pykeyword'>import</span> Timer
    names = (<span class='pystring'>'Woo'</span>, <span class='pystring'>'Pilgrim'</span>, <span class='pystring'>'Flingjingwaller'</span>)
    <span class='pykeyword'>for</span> name <span class='pykeyword'>in</span> names:
        statement = <span class='pystring'>"soundex('%s')"</span> % name
        t = Timer(statement, <span class='pystring'>"from __main__ import soundex"</span>)
        <span class='pykeyword'>print</span> name.ljust(15), soundex(name), min(t.repeat())
</pre></div>
         <div class="footnotes">
            <h3 class="footnotetitle">Footnotes</h3>
            <div class="footnote">
               <p><sup>[<a name="ftn.d0e39400" href="#d0e39400">14</a>] </sup>注意 <tt class="filename">soundex1d.py</tt> 在后两个测试点上都比 <tt class="filename">soundex1b.py</tt> 慢，这点与作者所说的矛盾。本章另还有多处出现了正文与测试结果矛盾的地方，每个地方都会用译注加以说明。这个 bug 将在下个版本中得到修正。――译注
               </p>
            </div>
         </div>
      </div>
      <table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
         <tr>
            <td width="35%" align="left"><br><a class="NavigationArrow" href="timeit.html">&lt;&lt;&nbsp;使用 timeit 模块</a></td>
            <td width="30%" align="center"><br>&nbsp;<span class="divider">|</span>&nbsp;<a href="index.html#soundex.divein" title="18.1.&nbsp;概览">1</a> <span class="divider">|</span> <a href="timeit.html" title="18.2.&nbsp;使用 timeit 模块">2</a> <span class="divider">|</span> <span class="thispage">3</span> <span class="divider">|</span> <a href="dictionary_lookups.html" title="18.4.&nbsp;优化字典查找">4</a> <span class="divider">|</span> <a href="list_operations.html" title="18.5.&nbsp;优化列表操作">5</a> <span class="divider">|</span> <a href="string_manipulation.html" title="18.6.&nbsp;优化字符串操作">6</a> <span class="divider">|</span> <a href="summary.html" title="18.7.&nbsp;小结">7</a>&nbsp;<span class="divider">|</span>&nbsp;
            </td>
            <td width="35%" align="right"><br><a class="NavigationArrow" href="dictionary_lookups.html">优化字典查找&nbsp;&gt;&gt;</a></td>
         </tr>
         <tr>
            <td colspan="3"><br></td>
         </tr>
      </table>
      <div class="Footer">
         <p class="copyright">Copyright © 2000, 2001, 2002, 2003, 2004 <a href="mailto:mark@diveintopython.org">Mark Pilgrim</a></p>
         <p class="copyright">Copyright © 2001, 2002, 2003, 2004, 2005, 2006, 2007 <a href="mailto:python-cn@googlegroups.com">CPyUG (邮件列表)</a></p>
      </div>
   </body>
</html>